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^N ' Abstract. Turn-based stochastic games and its important subclass Markov decision processes (MDPs) 

provide models for systems with both probabilistic and nondeterministic behaviors. We consider turn- 
based stochastic games with two classical quantitative objectives: discounted-sum and long-run average 
objectives. The game models and the quantitative objectives are widely used in probabilistic verification, 
planning, optimal inventory control, network protocol and performance analysis. Games and MDPs 
that model realistic systems often have very large state spaces, and probabilistic abstraction techniques 
are necessary to handle the state-space explosion. The commonly used full-abstraction techniques do 
Li^ . not yield space-savings for systems that have many states with similar value, but does not necessarily 

r K ' have similar transition structure. A semi-abstraction technique, namely Magnifying-lens abstractions 

^*^ I (MLA), that clusters states based on value only, disregarding differences in their transition relation was 

^ . proposed for qualitative objectives (reachability and safety objectives) (8). In this paper we extend the 

MLA technique to solve stochastic games with discounted-sum and long-run average objectives. We 
present the MLA technique based abstraction-refinement algorithm for stochastic games and MDPs 
with discounted-sum objectives. For long-run average objectives, our solution works for all MDPs and 
^ ' a sub-class of stochastic games where every state has the same value. 
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1 Introduction 



A turn-based stochastic game is played on a finite graph with three types of states: in player- 1 states, 
the first player chooses a successor state from a given set of outgoing edges; in player-2 states, the second 
player chooses a successor state from a given set of outgoing edges; and in probabilistic states, the successor 
state is chosen according to a given probability distribution. The game results in an infinite path through 
the graph. An important subclass of turn-based stochastic games is Markov decision processes (MDPs): in 
j^ ■ MDPs the set of player-2 states is empty. Turn-based stochastic games and MDPs provide models for the 

H I study of dynamic systems that exhibit both probabilistic and nondeterministic behavior. 

Turn-based stochastic games with qualitative objectives such as reachability, safety, and more general 
w-regular objectives has been widely studied in literature II6I7I2I5I in the context of verification of proba- 
bilistic systems. Many other application scenarios such as planning, inventory control, performance analysis 
require the study of turn-based stochastic games with quantitative objectives 09llll7i7 161 . The two clas- 
sical quantitative objectives studied in Uterature are as follows: discounted-sum (in short, discounted) and 
long-run average objectives flO'll. In both these objectives a real-valued reward is assigned to every state. 
For an infinite path (infinite sequence of states in the game graph), the discounted objective assigns a payoff 
that is the discounted sum of the rewards that appear in the infinite path, and the long-run average objective 
assigns the long-run average of the rewards that appear in the path. Turn-based stochastic games and MDPs 
with discounted and long-run average objectives provide an important and powerful framework for studying 
a wide range of applications BIOIU . 

Turn-based stochastic games and MDPs that model realistic systems typically have very large state 
spaces. Therefore the main algorithmic challenge in analysing such models consist of developing algo- 
rithms that work efficiently on large state spaces. In the non-probabilistic setting, abstraction techniques 



have been successful in coping with large state-spaces p^. By ignoring details not relevant to the property 
under study, abstraction makes it possible to answer questions about a system through the analysis of a 
smaller, more concise abstract model. The abstraction-refinement techniques for non-probabilistic setting 
do not always have a straight-forward extension to the probabilistic models. The commonly used full- 
abstraction techniques do not yield space-savings for systems that have many states with similar value, 
but not necessarily have similar transition structure. A semi-abstraction technique, namely Magnifying-lens 
abstractions (MLA), was proposed for a subclass of qualitative objectives (namely, reachability and safety 
objectives) [8]. MLA is a semi-abstract technique that can cluster states based on value only and can disre- 
gard the differences in their transition relation. MLA is particularly well-suited to problems where there is 
a notion of locality in the state space, so that it is useful to cluster states based on values, even though their 
transition relations may not be similar Many inventory, planning and control problems satisfy the locality 
property and would benefit from the MLA technique. In the setting of inventory, planning and control prob- 
lems quantitative objectives are more appropriate than qualitative objectives. This provides a strong and 
practical motivation for extending the work of |8l to provide MLA technique based solution for turn-based 
stochastic games and MDPs with quantitative objectives. 

In this paper we extend the MLA technique to solve stochastic games with quantitative objectives. The 
MLA technique of lH) works for MDPs and the special class of qualitative objectives, namely reachability 
and safety objectives (the model is quantitative with probabilities but the objectives are qualitative). We 
present the MLA technique based abstraction-refinement algorithm for both stochastic games and MDPs 
with discounted objectives. For long-run average objectives, our solution works for all MDPs and a sub- 
class of stochastic games where every state has the same value. We note that for long-run average objectives 
in stochastic games, the same assumption (of all states having the same value) is required for the relative 
value iteration algorithm to work lllllOlT . Hence our result present generalizations of the results of JD from 
the sub-class of reachability and safety objectives (which are Boolean) to the general class of discounted 
and long-run average objectives (which are quantitative). An abstraction-refinement based technique was 
proposed in JS] for turn-based stochastic games with quantitative objectives, but the technique of Q does 
not provide either a useful way to abstract probabilities, or the space-saving benefit of the MLA based tech- 
nique. Thus our algorithms provide space-efficient and practical algorithmic solutions for a wide class of 
problems of interest. To demonstrate the applicability of our algorithms we present a symbolic implemen- 
tation of our algoritms for MDPs with discounted objectives. In Section |5] we present many examples to 
illustrate cases where MLA based solution has a clear advantage over the full abstraction techniques, and 
our experimental results show that the MLA based technique gives a significant space saving. 

2 Preliminaries 

For a finite set S, a probability distribution on 5' is a function p : S -^ [0,1] such that X^^gg p{s) = 1; we 
denote the set of probability distributions on S by Dist(5'). A valuation over a set 5 is a function v : S ^ M. 
associating a real number v{s) with every s E S. For x G M, we denote by x the valuation with constant 
value x; for T C S*, we indicate by [T] the valuation having value 1 in T and elsewhere. For two valuations 
v,u on S, we define ||u — m|| = sup^^^ \'^{s) — ^t(s)|. 

A partition of a set 5' is a set i? C 2^^, such that lj^g^{s|s E x} = S and xDx' =0 for all x ^ x' E R. 
For s E S and a partition R of S, we denote by [s\b, the element x E R with s E x. We say that a partition 
R m finer than a partition R' if for any x E R there exists x' E R' such that x E x' . 

We consider the class of turn-based probabilistic games and its important subclass of Markov decision 
processes (MDPs). 

Game graphs. A turn-based probabilistic game graph (2 ^ji-player game graph) G = 
{{S,E), {Si, S2, Sp),6) consists of a directed graph {S,E), a partition (Si, S2,Sp) of the finite set 



'^ Thus the assumption is necessary even for classical value iteration algorithms and even without abstraction, and 
hence cannot be avoided in our setting with abstraction. 



S of states, and a probabilistic transition function 6: Sp -^ Dist(5'), where Dist(5) denotes the set of 
probabiHty distributions over the state space S. The states in Si are the player-l states, where player 1 
decides the successor state; the states in 5*2 are the player-2 states, where player 2 decides the successor 
state; and the states in Sp are the probabilistic states, where the successor state is chosen according to 
the probabilistic transition function S. We assume that for s ^ Sp and t G 5*, we have (s, t) E E iff 
d{s){t) > 0, and we often write S{s, t) for 5{s){t). For technical convenience we assume that every state 
in the graph (5*, E) has at least one outgoing edge. For a state s G S*, we write E{s) to denote the set 
{t G S* I (s,t) G E^ of possible successors. For s G 5 and a partition _R of S*, a region r2 G i? is 
called successor to a region ri G -R if at least one concrete state in ri has non-zero probability to reach 
concrete state(s) in r2. The Markov decision processes (1 ^/2-player game graphs) are the special case of 
the 2 V2-player game graphs with 6*1 = or 5*2 = 0. We refer to the MDPs with ^2 = as player-l MDPs, 
and to the MDPs with Si = ^ as player-2 MDPs. 

Plays and strategies. An infinite path, or a play, of the game graph G is an infinite sequence w = 

(so7 si, 52, . . .) of states such that (s^, Sk+i) G E for all fc G N. We write J? for the set of all plays, 
and for a state s G 5*, we write i7s C i7 for the set of plays that start from the state s. A strategy for player 1 
is a function a: S* ■ Si ^ Dist(5') that assigns a probability distribution to all finite sequences w E S* ■ Si 
of states ending in a player-l state (the sequence represents a prefix of a play). Player 1 follows the strat- 
egy a if in each player-l move, given that the current history of the game \s w E S* ■ Si, she chooses the 
next state according to the probability distribution cr{w). A strategy must prescribe only available moves, 
i.e., for all w G S* , s G 5*1, and t G 5*, if a{w ■ s){t) > 0, then (s, t) G E. The strategies for player 2 are 
defined analogously. We denote by S and U the set of all strategies for player 1 and player 2, respectively. 

Once a starting state s € S and strategies <t <E S and n <E IT for the two players are fixed, the outcome 
of the game is a random walk uj^'^ for which the probabilities of events are uniquely defined, where an 
event .A C i? is a measurable set of plays. For a state s G S* and an event ^ C ^, we write Pi'^'^{A) for the 
probability that a play belongs to A if the game starts from the state s and the players follow the strategies 
a and n, respectively. For a measurable function / : ^ — > M we denote by E^''^[/] the expectation of the 
function / under the probability measure Prl'^l-). 

Strategies that do not use randomization are called pure. A player-l strategy a is pure if for all w E S* 
and s G 5*1, there is a state t € S such that a{w ■ s){t) = 1. A memoryless player-l strategy does not 
depend on the history of the play but only on the current state; i.e., for all w,w' G S* and for all s E Si 
we have a{w ■ s) — a{w' ■ s). A memoryless strategy can be represented as a function cr: Si — s- Dist(S'). 
A pure memoryless strategy is a strategy that is both pure and memoryless. A pure memoryless strategy 
for player 1 can be represented as a function cr: Si — > S. We denote by Z!^^ the set of pure memoryless 
strategies for player 1 . The pure memoryless player-2 strategies ]J^^ are defined analogously. 

Quantitative objectives. A quantitative objective is specified as a measurable function / : i7 — > R. We 
consider zero-iMm games, i.e., games that are strictly competitive. In zero-sum games the objectives of the 
players are functions / and — /, respectively. We consider two classical quantitative objectives specified as 
discounted sum objective and long-run average (mean-payoff) objectives. The definitions of are as follows. 

- Discounted objectives. Let r : S ^ M>o be a real-valued reward function that assigns to every state 
s the reward r{s), and let < /3 < 1 be a discount factor. The discounted objective Disc assigns to 
every play the /3-discounted sum of the rewards that appears in the play. Formally, for a play lu ~ 
(so,si,S2,S3,---) wehave Disc(/3,r)(cj) = Y.^o^' -ris^). 

- Long-run average objectives. Let r : S ^' R>o be a real-valued reward function that assigns to every 
state s the reward r{s). The long-run average objective LimAvg assigns to every play the long-run 
average of the rewards that appear in the play. Formally, for a play w = (si, S2, S3, . . .) we have 
LimAvg(r)(cj) = liminfr^oo ^ ■ ELV ^(*»)- 

Values and optimal strategies. Given a game graph G, and quantitative objectives specified as measurable 
functions / and — / for player 1 and player 2, respectively, we define the value functions Vali and Va/2 for 



the players 1 and 2, respectively, as the following functions from the state space S to the set M of reals: for 
all states s E S, let 

Valfif)i.s) = sup inf E---[/]; Val^i-f)is) - sup inf EJ^^[-/]. 

In other words, the values Valf (/) (s) give the maximal expectation with which player 1 can achieve her ob- 
jective / from state s, and analogously for player 2. The strategies that achieve the values are called optimal: 
a strategy a for player 1 is optimal from the state s for the objective / if Val^ (/) (s) — inf TrGii E^^^ [/] . The 
optimal strategies for player 2 are defined analogously. We now state the classical memoryless determinacy 
results for 2 1/2-player games with discounted and long-run average objectives. 

Theorem 1 (Quantitative determinacy I 10I12I ). For all 2 ^/2-player game graphs G, the following asser- 
tions hold. 

— For all reward functions r : S ^- K>o, for all < (3 < 1, and all states s G S, we have 

Valf{D\sc{P,r)){s) + Val^{D\sc{l3,-r)){s) = 0; 

Va/f (LimAvg(r))(s) + Val^{L\mAyg{-r)){s) ^ 0. 

- Pure memoryless optimal strategies exist for both players from all states for discounted and long-run 
average objectives. 

We now present the definition of the predecessor operator Pre. The operator Pre is an important operator 
that is used in many classical algorithms to solve 2 Y2-player games with discounted and long-run average 
objectives. 

Definition 1 (Tlie predecessor operator (Pre)). Given a game graph G = {{S, E), (^i, S'2, Sp),S), the 
predecessor operator Pre takes a valuation wiS* — > ]R>o and returns a valuation Pre{v):S — > M>o defined 
as follows: for every state s E S we have 

{maxtg£;(s) v{t) s e Si 

mintg£;(s) v{t) s e S2 

Etes'^(«'*) •"(*) s e Sp. 

3 MLA for Discounted Objectives 

In this section we present algorithmic solutions for 2 Y2-player games and MDPs with discounted objectives. 

Classical Algorithms. We present the algorithms to solve a turn-based stochastic games with discounted 
objectives. 

Theorem 2 ( IIIOIIII ). Given a turn-based stochastic game graph G, with a reward function r : S ^>- M>o 
and a discount factor < (3 < 1, the following assertions hold. 
1. (Value iteration). Consider the sequence of valuations Vq, wi, V2, ■ ■ • as follows: let wq = and for all 
i > and s E S we have 

v,+i{s) = (1 - /?) • r{s) + p ■ Pre{v,){s). 

The sequence {vi)i>o converges monotonically to Val^ (Disc(/3, r)). 



2. (Fixpoint solution). There exists a valuation v* that is the unique fixpoint of the function f(v){s) ~ 
(1 — /3) • r{s) + j3 ■ Pre(v)(s), i.e., for all s G 5 we have 

v*{s) = {1-I3)- r{s) + P ■ Pre{v*){s) 

and we have V* — Vali (Disc(/3. r)). 

The classical algorithms. The classical algorithms for solving turn-based stochastic games are based on the 
result of Theorem|2]and are as follows. 

1. We obtain the sequence of valuations {vi)i>o as given by Theorem|2]by iterating over the valuations, 
and the sequence converges (w.rt. an error tolerance Efloat) to the desired value of the game. 

2. The fixpoint v* that gives the desired value of the game can be obtained by solving optimization prob- 
lems: if the game graph is an MDP, then it can be obtained from the solution of a linear-programming 
problem [13], and for general turn-based stochastic games it can be obtained as a solution of a quadratic 
programming problem lfT4ll . 

Abstract properties for upper and lower bound of value functions. We first present certain abstract properties 
of functions that can be used to obtain upper and lower bounds on the value of stochastic game with a 
discounted objectives. Later we will present a concrete functions that satisfies the abstract properties and 
can be implemented by the magnifying lens abstraction techniques. 



Theorem 3. Let G be turn-based stochastic game graph with reward function r : S ^f R>o and discount 

' r 



factor f3. Let M — maxsgs \r{s) \ and let Q = ytta ■ Consider the function f on valuations such that 



f{v){s) = {l~l3)-r{s)+l3-Pre{v){s). 

Let /+ and /_ be two functions on valuations that satisfy the following conditions: 
L /+ and /_ are monotonic; 

2. for all valuations v we have /-(w) < f{v) < f+{v); 

3. for all valuations bounded by Q (i.e., for all s £ S we have ~Q < v{s) < Q) we have —Q < f-{v) < 

'/+(«) < Q- 

Then there exist least fixpoints w^ and v*_ of fj^- and /_ and ul < Vali (Disc(/3, r)) < v*^_. 

In the following we will use the magnifying lens abstraction techniques to define functions /+ and /_ 
that satisfies the properties of the above theorem. This will allow us to obtain efficient solution of turn-based 
stochastic games with abstraction techniques. 

Magnifying Lens Abstraction Algorithm. Magnifying-lens abstractions (MLA) is a semi-abstract tech- 
nique that can cluster states based on value only, disregarding differences in their transition relation. Let 
V* be the discounted sum valuation over S that is to be computed. Given a desired accuracy eabs>^, MLA 
computes upper and lower bounds for v*, spaced less than £abs apart. 

Algorithm Sketch. The MLA algorithm is shown in Algorithm[T] The algorithm has parameters G, /3, r, and 
errors Sabs > 0, ejioat > 0. Parameter £abs indicates the allowed maximum difference between the lower and 
upper bounds returned by MLA. MLA starts from an initial partition (set of regions) R of S. The initial 
partition R is obtained either from the user or from the property. Statement 2 initializes the valuations u^ 
and «+ to since discounted sums are computed as least fixpoints. MLA computes the lower and upper 
bounds as valuations u^ and u+ over R by GlobalVallter Algorithm (Algorithm ^. Global iterations, 
when implemented as a value iteration (Algorithm|2]|, contains an extra parameter £fioat>0- Parameter Sfloat, 
stopping parameter of classical value iteration, specifies the degree of precision to which the global value 
iteration should converge. For accurate global iterations, we can set the parameter efl„at to 0. The partition 
is refined, until the difference between u^ and u+, for all regions, is below a specified threshold. 



Algorithm 1 MLA(G', /3, r, Eats, £float) Magnifying-Lens Abstraction 



Input : game G, discount factor /3, 

reward function r : S — > K>o, 
errors £„;„ > 0, £/„„, > 

Output : final partition R, valuations u'^ ,u^ : R 



1. 

2. 


R:= some initial partition. 
M":=0;u+— 








3. 
4. 


loop 

u+ := u~ 








5. 
6. 

7. 


u+:= GlobalValIter(G, R, u+ 
u- ■- GlobalValIter(G, R, u~ 

if 11^+ -It^ll > Sabs 


,13, 


r, max 
r, min, 


, £ float) 
£floal) 


8. 
9. 


then R,u~,u+:= SplitReg: 
else return R,u~ ,u'*' 


ions{R,u~ 


,u'^,£ab: 


10 


end if 








1 1 .end loop 









Algorithm 2 GlobalValIter(G, R, u, /3, r, h, Efloai) Global Value Iteration 

Input : game G, partition R, valuation u : R ^ IR>o, 

discount factor /3, reward function r : 5 — )■ R>o, 

h G {max, min}, error Sfioat > 
Output : valuation u : R ^ R>o 
1 . repeat 

2. u:=u 

3. for X e 7? do 

4. it(x):= MagIter(G, R, x, u, /3, r, h, Efloai) 

5. end for 

6. until I |m — ■&! I < e/(oa, 

7. return u 



Global Value Iteration (GlobalVallter). To compute u^ (resp. u+), GlobalVallter considers each region 
X € R in turn, and performs a magnified iteration (MI): it improves the bounds u^(a;) (resp. u^{x)) by 
solving the sub-games on the concrete states in r. 

Magnifiedlteration (Maglter). The goal of the magnified iteration algorithm is to either (a) iterate function 
/+ and /_ with properties of Theorem[3]or (b) obtain fixpoints of /+ and /_. To obtain the desired func- 
tions we define an auxiliary function g and a magnified predecessor operator MPre and then present the 
magnifying lens abstraction implementation of MPre. 

Definition 2. Given a game graph G = {{S, E), (Si, S2, Sp),d), two states s,t G S, a partition R, a 
valuation u : S* — t- R>o, h G {max, min}, we define the following auxiliary function g as follows: 

g(s,h,R,v)(t)- [h{v(t')\t'e[tU}t<^[sU 

The function g is as follows: given two states s and t, a valuation v, a partition R and a function 
h G {max, min}, it returns the valuation v(t) if s and t belong to the same partition, otherwise it returns the 
result of applying h to the values v(t') of the states t' that belongs to the same region as t. We now define 
the magnified predecessor operator MPre that is similar to Pre but applies the function g to obtain values. 

Definition 3 (Magnified Predecessor Operator (MPre)). Given a game graph G — 
((S, E), (Si, S2, Sp),d), a partition R, a valuation w : 5 — > R>o, h G {max. min}, we define the 



valuation MPre{h, v, R) : 5 — > M>o as follows: let z represent (s, h, R, v), then for all states s ^ S, we 

have 

max g{z)(t) s € Si 

tGB(s) 

min g{z)(t) s £ S2 

teE{s)' 

Y, 6{s,t) ■ g{z){t) s e Sp 

\^ tGB(s) 



MPre{h,v,R){s) 



Lemma 1 (Properties of MPre). Given a game graph G,for all partitions R and all h G {max, min}, we 
have 

1. MPre{h,v, R) is monotonic i.e. for two valuations v,v', if v < v', then MPre{h,v, R) < 
MPre{h,v',R). 

2. If valuation v is bounded by Q, then MPre{h, v, R) is also bounded by Q. 

3. Ifh is max, then Pre{v) < MPre{h, v, R), and ifh is min, then Pre{v) > MPre(h, v, R). 

The above lemma shows that MPre with h as max and min, respectively, satisfies all the properties of /+ 
and /_ of Theorem[3] respectively. Hence we obtain the following lemma. 

Lemma 2. Given a game G, for all partitions R, and all valuations w : 5* — 5- M>o, consider the following 
functions: 

l+{v){s) = /3 • r(s) + (1 - /3) •M/'re(max,i;,i?)(s); 

l-{v){s) ^ l3-r{s) + (1-/3) ■ MPre{mm,v,R){s). 
Then there exist least fixpoints v'^ and v*_ ofl+ and 1-, respectively, such that v*_ < Val^ (Disc(/3, r)) < 

Magnified Iteration Implementation. We now present the implementation details of the magnified iteration 
techniques. The operator MPre takes as input a valuation over the whole state-space S, and returns a 
valuation over the whole state space. In the magnifying lens abstraction implementation, our goal is to save 
space, and operate on valuations that are not on the whole state space. To achieve this goal, for a given 
region x G i?, we define a new operator MPre^ and present its relation with MPre. 

Definition 4 {MPrex)- Given a game graph G = {{S, E),{Si, S2, Sp),d), a partition R, a region x G R, 
valuations u : i? — >■ R>o, v^ '■ x ^ K>o, we define the valuation MPre^ivx, -R, u) : a; — > M>o as follows: 
for all states s (z x, we have 



MPrex{vx,R,u){s) 



max g{y){t) s G Si 

t£E(s) 

min g{y){t) s e S2 

teEis) 

J2 Sis,t) ■ giy)it) s e Sp 

(, t£E{s) 



where y represents (s, R, v^, u). The auxiliary function g can be defined as follows: 

Vx{t) t e [s]r 



Observe that MPrex takes a valuation on the states of a region x (instead of a valuation on the whole 
state space), and a valuation on the partition of the state space (and hence requires much smaller memory 
than a valuation on the whole state space). The following lemma establishes the relation of MPre and 
MPrex. Hence we always achieve the implementation of the MPre operator as MPrex- 



Lemma 3 (Relation of MPre and MPrCx)- Given a game graph G, for all partitions R and all h e 
{max, min}, for all valuations v : S ^ K>o. for all x G R, let v^ '■ x ^ R>o be a valuation such that 
Vxis) — v{s) for all s G x, and let u : _R — > R>o be a valuation such that u{x) — h{v{s) \ s G x}. Then 
we have MPrex{vx, R, u){s) = MPre{h, v, R){s) for all s G x. 

Magnified iteration, which involves the MPre^ implementation of MPre using magnifying-lens ab- 
straction technique, can be done in two ways like the classical algorithms. We present them below. 

Solution offixpoint by optimization. The fixpoints of the functions that provide upper and lower bound on 
the value using MPre and h as max and min can be obtained by solution of optimization problems. We 
present the fixpoint solution for the case when h is max and the case when h is min is similar Given a 
partition R, we have two valuation variables u+ : i? — > M>o and u : S* — ?> M and we denote by v^ the 
valuation variable v restricted to a region x G R. We have a set of global constraints that specifies that in 
every region x the value u~^{x) is the maximum value of Vx{s) for all s G x; i.e., we have the following 
constraints 

m'^(x) — hs£xVx{s) for all x G R. 

Along with the above constraints we have local constraints for every region x G R and it specifies that 
Vx{s) should satisfy the fixpoint constraints for MPrCx- In other words, for every region x E Rwe have 
the following set of local constraints: 

Vx{s) = {1~ 13) ■ r{s) + 13 ■ MP^xivx, R,u+){s) for all s G x. 

Thus instead of solving one huge optimization problem, using the MPrex we decompose the optimization 
problem into many smaller sub-problems with independent sub-parts. Thus the solution is more space ef- 
ficient and can be achieved faster in practice. Also notice that the solution by optimization to obtain the 
fixpoint correspond to the solution of magnified iteration (Maglter) with Sfioai = 0. 

Theorem 4 (Correctness of Approximation). Given a turn-based stochastic game G = 
{{S, E), {Si, S2, Sp),6), a discount factor (3, a reward function r, and error bounds eabs>0, and 
Sfloat — 0, the following assertions hold: let (R, u+, u^) = MLA{G, (3, r, Eats, 0), then 

1. for alls G S we have u~ {\s\r) < Valf{D\sc{f3,r)){s) < u'^{[s]r) ; and 

2. for all x G R we have u'^(x) — u~ [x) < Eabs- 

Value iteration implementation of Maglter The Magnified Iteration (Maglter) step can also be implemented 
as a value iteration approach. When Maglter is implemented as a value iteration, then we require that 
£float>0- The parameter Sfloat specifies the degree of precision to which the local, magnified value iteration 
should converge. Algorithm[3]describes the formal description of the procedure. 

Theorem 5 (Termination and Correctness). Given a turn-based stochastic game G = 
{{S,E), {Si, S2, Sp),6), a discount factor /3, a reward function r : S ^>^ R>o> for all error bounds 
Sahs>0, the following assertions hold. 

1. For all Efloat > 0, the call MLA{G, /?, r, £ahs, Afloat) terminates. 

2. There exists an error bound Sfioat such that if (R, m"*", m~) = MLA{G, (3, r. Eats, Afloat), then 

(a) for all s G S we have u^ {[s\ii) < ValY{D\sc{(3,r)){s) < u'^{[s]ii); and 

(b) for all X G R we have u'^{x) — u^ (x) < Sabs- 

Adaptive refinement step (SplitRegions). The step SplitRegions is obtained by adaptive refinement of regions 
with large imprecisions. We denote the imprecision of a region x by A{x) — u+(x) — u^{x). MLA 
adaptively refines a partition R by splitting all regions x having A{x) > Sahs- The refinement scheme is 
simple and easy to implement. Thus a call to SplitRegions{R,u^ ,u~ ,eahs) returns a triple R,u~,u~^, 
consisting of the new partition with its upper and lower bounds for the valuation. Like [Sj, we also tried 
other refinement heuristics, but none of them gave strictly better results. 



Algorithm 3 MagIter(G, R, x, u, /3, r, h, £float) 



Input : game G, partition R, a region x £ R, 

valuation u : R -^ K>o> discount factor /3, 
reward function r : S — >■ M>o 
h € {max, rain}, error Sfioat 

Output : a value u{x) : R>o 

Data Structure :v,v: valuations over x 

1. for s G X do v{s)—u{r) end for 

2. repeat 

3. v:—v 

3. for s e a; do 

4. w(s) = (l - P) ■ r{s) + /3 ■ MPre^iv, R, u){s) 

5. end for 

6. until ||U — {)|| < Efloal 

7. return h{v{s) \ s £ x} 



Space Savings For value iteration algorithm, the space requirement is equal to the size of state-space \S\, 
the domain of u . For ML A, the space requirement is equal to be the maximum value of2-|i?|+maxa;gjf|a;|. 
The expression gives the maximum space required to store the valuations u+, u^ , as well as the values v 
for the largest magnified region. Since maxa;gi^ |a::| > (|5'|/|i?|), the space complexity of the algorithm 
is (lower) bounded by a square-root function yJS^\S\. However, this bound is provided for the concrete 
implementation . 

4 MLA for Long-run Average Objectives 

In this section we present magnifying lens abstraction solution for a class of stochastic games with long-run 
average objectives. We first describe the efficient classical solution and then present our magnifying lens 
abstraction solution. 

Value iteration for long-run average objectives. A value iteration algorithm can be used to compute the 
long-run average value as follows: for a state s we compute by value iteration the maximum expected sum 
of the rewards for fc-step starting from s, and we denote this sum as S{k, s). Then the value of the state 
s is limfe^oo k ■ However, this technique is not very practical as S{k, s) — > oo and S{k, s) diverges 
fast towards infinity. Hence computing S{k, s) and dividing by k is computationally expensive and not 
very practical. This problem can be alleviated by relative value iteration algorithm that subtracts a number 
c G M in each iteration. This technique trims the values for all states simultaneously, and this technique is 
an efficient way to compute values in games that have same values in all states. 

Lemma 4. Consider a turn-based stochastic game graph G = {{S,E), {Si, S2, Sp),S) with a reward 
function r : S ^ IR>o- For a real number c, consider a sequence of valuations (wi)i>o as follows: let 
Vq — c and for alii > and s £ S we have, Wi+i(s) = r{s) — c + Pre{vi){s). If there exists a real value 
V* such that for all s ^ S we have Val^ (LimAvg(r))(s) — v*, then the following conditions hold: 

1. The sequence {vi)i>o diverges to +00 iffc < v*. 

2. The sequence (i'i)i>o diverges to —00 iffc > v*. 

The relative value iteration algorithm chooses a real value c, and then adjusts the value of c adaptively 
depending on whether the sequence {vi)i>o, given the chosen value c, diverges to +00 or —00, otherwise 
the chosen real number c is value of the game. 

MLA for Stocliastic Games. We develop magnifying lens abstraction solution for stochastic games under 
the assumption that there is a uniform value v* such that every state has the same value v*. Later we will 



consider the question of presenting criteria for its existence. For MDPs we will present our solution of all 
MDPs (without the assumption of existence of uniform value) . The magnifying lens abstraction solution for 
stochastic games with long-run average objective is based on the following lemma. 

Lemma 5 (Magnified Relative Value Iteration.). Given a game graph G, for all partitions R, consider 
two sequence of valuations (w^)i>o and {v^)i>o as follows : let Vq = Vq = c and for alii > and s € S 
we have: 

v^+i{s) = r{s) — c+ MPre{max,v^ , R){s) 

^j+i(^) = ''(*) ^ c + MPre(min, v^ , R){s) 

If there exists a real value v* such that for all s ^ S we have Val^ (LimAvg(r))(s) = v* , then the following 
conditions hold: 

1. If the sequence {v^)i>Q diverges to +cxd, then c < v*. 

2. If the sequence (w^)i>o diverges to — oo, then c > v*. 

Lemma|4]relates the divergence of the sequence (vi)i>o for a chosen c and the value of the game in both 
directions (iff conditions giving necessary and sufficient conditions), whereas the Lemma|5](with magnified 
pre operator) relates the value of c and the divergence of the sequence in one direction, i.e., if the sequence 
diverges in a given direction (i.e,{v^)i>o to — oo or {v^)i>o to +cx)), then we conclude the relation of the 
value of the game and the chosen value c. We now present the algorithm that is based on Lemma|5] 

Algorithm Sketch. Algorithm |4] provides an algorithm to approximate the long-run average value of a 
stochastic game G such that every state has the same value. Algorithm uses Lemma |5] to obtain upper 
and lower bound on the value of the game by a dichotomic (binary) search. The search space in bounded 
by the interval [c~ , c+], where c+ and c~ denote an upper and a lower bound on the value of the game, re- 
spectively. The initial value of c"*" (resp. c~) is obtained from the maximum (resp. minimum) reward value 
of the game. Each iteration of this binary search starts by setting c to the mid-point of the interval. If with 
the chosen value of c, the sequence {v^)i>o diverges to — oo, then c is an upper bound on the value of the 
game, and c+ is set (decreased) to c. Similarly, if with the chosen value of c, the sequence {v^)i>o diverges 
to — oo, then c is an lower bound on the value of the game and c^ is set (increased) to c. The procedure 



Algorithm 4 MLALongRun(G', r, Sabs, k) Magnifying-Lens Abstraction 
Input : game G, reward function r : S ^t R>o 

errors Sabs > 0, 

maximum number of iterations k : integer 

a real value ratio £ [0, 1] 
Output : final partition R 

1. R:— some initial partition 

2. c+:=maxsesf(s), c~:=minsgsr(s) 

3. while (c+ — c^) < £„*., do 

4. c--{c+ +c')/2 

5. d'^ ,v'^:= CheckDivergence(G, r,R,c, max, k) 

6. d~ ,v~:= CheckDivergence(G, r, R, c, min, fe) 

7. if d~ = + then c :=c 

8. else if d+ = then c+ :=c 

8. else R := SplitRegions(_R, v^ ,v~ , Eats, ratio) 

9. end if 
lO.end while 



CheckDivergence returns a verdict on divergence of the sequence by computing {k + 1) elements of the 



sequence. The verdict +,— denote the divergence to +00 and — cxd respectively. However, the verdict ? tells 
that either (a) fc-elements of the sequence is not enough to detect the divergence or (b) the sequence may 
not diverge to +cx3 or —00. If CheckDivergence returns + for the d^ , then c^ is set to c, and if CheckDiver- 
gence returns — cxd for the d^, then c+ is set to c. Otherwise we do not have enough information to update 
c+ or c^, and algorithm refines the partition R by invoking SplitRegions procedure. The imprecision of a 
region cc e i?is denoted by A{x) = v^{x) — v^{x). The procedure Sp/ifTJeg/on^ splits a number (precisely 
ratio ■ \R\) of high imprecision regions. 

Detecting divergence to +00 and —00. Algorithm|5]illustrates the procedure CheckDivergence to detect the 
divergence of a sequence starting from a given c and a given number of iterations k. If the value for every 
state increases beyond c after fc-iterations, then the sequence diverges to +00 (the procedure returns +), and 
if the value for every state decreases below c, then the sequence diverges to —cxd (the procedure returns +). 
Otherwise, the divergence to +00 or —00 cannot be concluded and then the procedure returns the verdict ?. 
The algorithm also returns the (k + l)-th valuation of the sequence starting from c. 



Algorithm 5 CheckDivergence (G, r, R, c, h, k) 

Input : game G, reward function r : S -^ R>o, 
a partition R, a chosen value c : R>o, 
h £ {max, min}, 
maximum number of iterations k : integer 

Output : enum d G {+, — , ?}, valuation v : R ^ R> 

1. ■i;o:=c, d:—? 

2. for i = to fc do 

3. for each x £ R 

4. Vi+i{x) ■.= MagItei2{G, R,x,Vi,r,h.k) 

5. end for 

6. end for 

7. if (min2,6ijUfc+i(a;) > c) thend := + 

8. if (maxj;6Ht;fc+i(a;) < c) then d := - 

9. return d, v^+i 



Magnified Iteration (Long Run Average version). Algorithm|6]provides the details of the magnified iteration 
of a region x G R. The algorithm completes value-iteration for /c-iterations over the states of the region 
X, and summarizes the values to a single value. Like discounted case, we use MPre^ operator for the 
magnifying lens abstraction implementation. 

Theorem 6 (Termination and Correctness). Lef G = {{S, E), (Si, 6*2, S'p), 6) be a turn-based stochastic 
game with a reward function r : S ^>- R>o such that there exists v* £ R>o and for all s £ S we have 
Val^ (LimAvg(r))(s) = v*. For all error bounds eabs>0, the following assertions hold. 

1. The call MLALongRun(G, r, Sabs, k) terminates. 

2. There exists a positive integer k such that if (R, m+, u^) = MLALongRun{G, (3, r, Sabs, k), then 
(a)forallsGSwehaveu~{[s]ii) < VaZ^ (LimAvg(r))(s) < u'^{[s]ji); and 

(b) for all X G R we have w+ (x) — u~ (x) < Sabs- 

Ensuring uniform value. The following theorem presents a sufficient condition to ensure uniform value in 
a turn-based stochastic game (i.e., the same value everywhere). The condition can be checked in polyno- 
mial time using algorithms for solving turn-based stochastic reachability games with qualitative winning 
criteria |5 1. For a state t we denote by (}t the set of paths that reaches t. 

Theorem 7. Consider a turn-based stochastic game graph G with a reward function r : S ^>- IR.>o. Sup- 
pose there exists a state t such that the following conditions hold: 



Algorithm 6 MagIter2(G', R, x, u, r, h, k) 



Input : game G, partition R, a region x £ R, 
valuation ii : i? — >■ R>o, 

reward function r : 5 — s- R>o, h £ {max, min}, 
maximum number of iterations k : integer 

Output : a value u{x) : R>o 

1. for s G a; do vo{s)=u{r) end for 

2. for i = to A: do 

3. for s e a; do 

4. v^+i{s):=r{s)-c+ MPre^{v^,R,u){s) 

5. end for 

6. end for 

7. return h{vk+i{s) \ s £ x} 



1. for all s E S there exists a player 1 strategy a such that against all player 2 strategies tt we have 
Pr1-''{()t) > 0;and 

2. for all s E S there exists a player 2 strategy tt such that against all player 1 strategies a we have 
Prf^iOt) > 0. 

Then there exists a real value v* such that for all s E S we have Val^ (LimAvg(r)) ~ v*. 

Proof. Suppose condition 1 holds, and then by existence of pure memoryless optimal strategies in turn- 
based stochastic reachability games |5|, there is a witness pure memoryless strategy a* to witness that 
for all states s the state t is reached with positive probability against all player 2 strategies. Hence if we 
fix any pure memoryless counter strategy tt for player 2 the closed recurrent set must contain t. From the 
existence of pure memoryless optimal strategies for turn-based stochastic games with reachability and safety 
objectives, it follows that player 1 can ensure that from all states s the state t is reached with probability 1 . 
Hence for all states s we have Vali (LimAvg(r))(s) > Val^ (LimAvg(r))(t). Similarly, if condition 2 
holds, then player 2 can ensure that t can be reached with probability 1 from all states s and hence for all 
states s we have VaZf (LimAvg(r))(s) < Valf{L\mA\/g{r)){t). Hence v* = VaZf (LimAvg(r))(f) is the 
witness real value to show that the claim holds. I 

MLA for MDPs. Now we present the magnifying lens abstraction solution for all MDPs with long-run 
average objectives (i.e., the solution works for MDPs such that values at different states may be different). 
The main idea relies on the end component decomposition of an MDP. 

Definitions (End component). Given an MDP G = {{S,E),{Si,Sp),5) a set C of states is an end 
component ;/ the following conditions hold: (a) the set C is strongly connected component in the graph 
induced by {S, E); and (b) for all probabilistic states s G C fl Sp, all out-going edges of s is contained in 
C, i.e., E{s) C C. An end component C is maximal if for any end component C we have either (a) C' ^ C 

or (b) C" n C = 0. 

The following theorem states that given an MDP with a long-run average objective, if we consider the 
sub-game graph induced by an end component C, then all states in C would have the same value. 

Tlieorem 8 (Existence of uniform value for MDPs). Let G = ((5, E), {Si, Sp), S) be an MDP with a 
reward function r. Consider an end component C in G and the sub-game graph G \ G induced by G. Then 
there exists a real value v* such that for all s G G we have Val^ (LimAvg(r))(s) = v*. 

It foUows from Theorem |8] that if we consider the sub-game graph induced by an end component of 
an MDP, then the condition of uniform value (all states having the same value) is satisfied. It follows from 



the results of 1,7 .6 J that in an MDP for all strategies with probability 1 the set of states visited infinitely 
often is an end component. Hence a pure memoryless optimal strategy consists in reaching the correct end 
component, and then play optimally in the end component. Thus we obtain the following magnifying lens 
abstraction algorithm for MDPs; the algorithm consists of the following steps: 

1. the MDP is decomposed into its maximal end components (this can be achieved in quadratic time); 

2. in the sub-game graph induced by an maximal end component we approximate the values by the general 
algorithm (since the uniform value condition satisfied we can apply the general algorithm for stochastic 
games); 

3. once the values in every maximal end component are approximated we can collapse every maximal 
end component as a single state and obtain an MDP with no non-trivial end component (every end 
component is a single state end component), and then compute the values by an algorithm that computes 
the maximal value that can be reached in an MDP with no non-trivial end components. 

5 Examples and Experimental Results 

In this section, we provide examples and case studies on MDP models with large state-spaces and the local- 
ity property. In our implementation we only consider MDPs because common probabilistic model-checkers 
(hke PRISM) only support MDPs. In future work we will consider the stochastic games implementations. 
Our examples show that the value-based abstraction methods perform better than the transition-based ab- 
straction methods for MDPs with locality property, although for some other case studies the algorithm may 
provide worse performance. We first present the examples, then our symbolic implementation of MLA 
algorithms for MDPs with discounted objectives and finally, our experimental results. 



Example 


Parameters 


States 


Transitions 


non-MLA 


MLA 


Nodes Time 


Nodes Time Regions 


Planning 


n=256,m=40 
n=512,m=40 
n=1024,m=50 


65,537 

262,145 

1,048,577 


265,419 
1,063,603 
4,211,564 


3,981 28 
12,420 106 
15,365 616 


1,658 37 330 
3,324 191 1,121 
4,596 883 1,670 


Auto 
Inventory 


nn.ax=2,047 i„,ax=2,047 

n„,ax=2,047 tmax=4,095 

nn.ax=4,095,fn.ax=4,095 


4,194,304 

8,388,608 

16,777,216 


20,965,376 
41,930,752 
83,873,792 


12,719 65 
12,676 114 

25,434 287 


1,171 209 99 
1,095 259 99 
6,293 606 99 


Machine 
Replacement 


n=1023, tm=1023 
n =2047, tm=2047 
n =4095, tm=4095 


1,047,552 

4,192,256 

16,773,120 


3,141,632 
12,574,720 
50,315,264 


6,051 17 
12,185 47 
24,461 141 


419 64 63 
960 152 64 
960 364 65 


Network 
Protocol 


M=7,tm =2,047 
M = 15, tm=2,047 
M = 15, tm=4,095 


1,781,760 

7,745,536 

15,491,072 


7,157,760 
31,045,632 
62,091,264 


328 5 
369 8 
369 17 


227 11 245 
267 24 647 
267 33 647 



Fig. 1. Experimental results: Symbolic discounted MLA, compared to discounted value iteration 



Planning: We consider an MDP that models the movement of a robot in a two-dimensional (n x n) grid. 
The grid contains rn mines. The robot at position {x, y) can choose to move in any of forward, backward, left 
or right direction to reach the positions (x + l,y),{x — 1, y), (x, j/ — 1) or (x, y + 1) respectively. However, 
there is a chance p that the robot may not reach the desired positions and dies due to the explosion of a 
mine, and in that case the robot reaches a special state called sink state. The probability distribution (i.e., 
p) is a function over distances from the m mines. The robot spends power for its movement and collects 
rewards (i.e. recharges) associated with the chargeable states when it visits them. In this example, the state- 
space is two-dimensional, and every state has transitions to the next states in the state-space (i.e., has the 
locality property). The robot needs to explore the grid points in an intelligent manner such that the robot 
spends minimum energy. The property of interest is either to maximize the discounted reward or the long 
run-average reward of the robot. 



Automobile Inventory: In this example we model an inventory of an automobile company. Let n denote 
the current number of items in the inventory and < ti < Timax holds where rimax denotes the maximum 
capacity of the inventory. Let t denote the age of the inventory in months and < i < tmax holds where i^ax 
is the total life time of the inventory. Let us assume that an unsold car in the inventory becomes cheaper 
every year by a discount factor as /3. Every year company decides whether to manufacture a predefined 
number nc of new cars. We assume that the number of cars sold per month, denoted by sold, follows 
a uniform probability distribution. We also assume that the value of sold can vary within a small range 
[soWmin, soZdmax] and thcsc two constants can be obtained from the car-sale statistics. The reward function 
is obtained from the cost (negative reward) of manufacturing, and the price (positive reward) of selling a 
car. The state-space of the model is defined as 5 = {n, t) where n denotes the number of the cars and t 
denotes the current month. The inventory example contains the locality property; since each state (n, t) has 
transitions to the nearby states {{n — sold, t + 1), (n, t + 1) or (n + nc, t + 1)). After each fiscal year, the 
company computes the optimal value of the inventory keeping the discount factor in mind. The property of 
interest is the "optimal discounted sum" of the car inventory. 

Machine Replacement: In this example we model the machine replacement problem. The state of 
machine can be in n different working states from to n — 1. The state value denotes that the machine is 
not working and the state value n — 1 denotes the machine is new. The time is denoted by the variable t and 
ranges between to a predefined maximum value tm. The machine can be replaced at any time and the new 
machine costs money (assume that the machine replacement does not add any reward). The machine can 
get more work done when it is in a better (state value higher) state, hence earns more money. The property 
of interest is the "optimal discounted value" of the machine. 

Network Protocol: Let us assume that n computers follow a simpler version of Ethernet protocol to 
send a pre-defined number (M — 1) of packets to a shared channel. Let t denotes the time elapsed since the 
start of the protocol and the condition < < < imax holds, where tmax is the time-out limit of the protocol. 
The state-space of the model can be given as a tuple S — {pki,pk2, . . . ,pfc„) where < pki < M 
denotes the number of the packets sent from the computer i. If the computer i sends one more packet to 
the channel at time t, then the i-th component of the state changes to min{(pfci + 1), Af }. However, two 
or more computers can send packets to the channel at the same time frame (collision) and both packets are 
lost (the state of MDP does not change). After the collision, each computer waits for a random amount of 
time before sending it again. Each computer will check whether the channel is busy in time frame t. If the 
channel is busy at frame t, the computer does not send packets at frame {t + 1). Otherwise, the computer 
has two actions - either (1) send at frame (t + 1), or (2) does not send. When two computers send packets 
to the channel at the same time frame, there is a collision and both packets are lost. After the collision, 
each computer waits for a random amount of time before sending it again. The waiting time for the next 
packet are decided by the stations following a probability distribution. Since the packets numbers are serial 
in numbers, the MDP model contains the locality property. The average throughput of the shared channel 
is measured by the percentage of the packets sent without a collision. We are interested to compute the 
efficiency of the protocol by computing the average or discounted throughput property. 

MTBDD-based Symbolic Implementation in PRISM : We have implemented both versions (with 
and without ML A) of symbolic discounted algorithms within the probabilistic model checker PRISM ifTTI . 
We used the MTBDD engine of PRISM, since (a) it is generally the best performing engine for MDPs; and 
(b) it is the only one that can scale to the size of models we are aiming towards. The current examples with 
quantitative objectives cannot be handled directly by PRISM, and hence we have added a new functionality 
of discounted reward computation in the tool PRISM. The initial partitions are picked based on the following 
choice. Internally, every integer variables with range I are converted into log2{l) binary variables. If the 
program have k binai-y variables, then we pick | as the initial level of abstraction. We have tried two types 
of partitioning procedure as proposed in | X5J . 

Results : The table above summarizes the results for all case studies (with discount factor 0.9, tabs = 
0.01 and e float = 0.0001). The first two columns show the name and parameters of the MDP model. 



The third and fourth columns give the number of states and transitions for each model respectively. The 
remaining columns show the performance of analyzing the MDPs, using both versions of the discounted 
algorithms. In both cases, we give the total time required in seconds and the peak MTBDD node count. For 
MLA, we also show the final number of generated regions. Our results show that MLA algorithm leads to 
significant space savings which is the real bottleneck in analysis of large MDPs. The number of regions 
increases with respect to the state-space; however the increase is linear or constant in these examples. The 
MTBDD node count columns provide a clear view that the symbolic iterations in the value-iteration involve 
the whole state space and the peak node-count is higher. MLA algorithms computes the value iteration in 
each region in a sequential manner, hence the size of the MTBDD graph is also smaller There is a slowdown 
when MLA is applied; however time is not a bottleneck in the symbolic model-checking tools like PRISM. 
Most case-studies that PRISM cannot handle often fail due to excessive memory requirements, not due to 
time. It is also clear, from the sizes of the MDPs in the table, that the symbolic version of MLA is able to 
handle MDPs considerably larger than were previously feasible for the explicit implementation of |,8J. 

6 Conclusion 

In this paper we extend the MLA technique to solve MDPs and stochastic games with quantitative objec- 
tives. MLA is particularly well-suited to problems where there is a notion of locality in the state space, 
so that it is useful to cluster states based on values, even though their transition relations may not be sim- 
ilar. Many inventory, planning and control problems satisfy the locality property and would benefit from 
the MLA technique. In the setting of inventory, planning and control problems quantitative objectives are 
more appropriate than qualitative objectives. We present the MLA technique based abstraction-refinement 
algorithm for both stochastic games and MDPs with discounted and long run objectives. To demonstrate 
the applicability of our algorithms we present a symbolic implementation of our algorithms in PRISM 
for MDPs with discounted objectives. Our experimental results show that the MLA based technique gives 
significant space saving over value-iteration methods. 
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Appendix 

Proof, (of Theorem |3]l. Since /_ and /+ are bounded and monotonic, it follows that there exists fixpoints 
v*L and v*^ of /_ and /+, respectively. Since / is bounded by /_ and /+ it follows that the fixpoint v* of / 
is bounded by v*_ and f ^, respectively, i.e., ul < v* < v*^. The desired result follows. I 

Proof, (of Lemmall). The properties are straightforward to verify using Definition[3] I 

Proof, (of Lemma|2). It follows from Lemma[T]that /_ and Z+ satisfies the properties of functions /_ and 
/+ of Theorem[3] respectively. The results then follows from Theorem|3] I 

Proof, (of Lemma|3). The result is easy to using Definition[3]and Definition^ I 

Proof, (of Theorem 4 and Theorem 5). Both the proofs are based on the fact that as Eats and ejioat converges 
to 0, the output of the MLA algorithm converges to the value of the game. I 

Proof. (LemmaH). It follows from definition that for all valuations v we have 

MPre{mm,v,R) < Pre(v) < MPre {max, v,R). 

The result follows from the above inequalities and the results of LemmalU I 



